{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas怎样对每个分组应用apply函数?\n",
    "\n",
    "#### 知识：Pandas的GroupBy遵从split、apply、combine模式\n",
    "\n",
    "<div style=\"text-align:left; width:700px;\"><img src=\"./other_files/pandas-split-apply-combine.png\" style=\"\"/></div>\n",
    "\n",
    "这里的split指的是pandas的groupby，我们自己实现apply函数，apply返回的结果由pandas进行combine得到结果\n",
    "\n",
    "#### GroupBy.apply(function)  \n",
    "* function的第一个参数是dataframe\n",
    "* function的返回结果，可是dataframe、series、单个值，甚至和输入dataframe完全没关系\n",
    "\n",
    "#### 本次实例演示：\n",
    "1. 怎样对数值列按分组的归一化？\n",
    "2. 怎样取每个分组的TOPN数据？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 实例1：怎样对数值列按分组的归一化？\n",
    "\n",
    "将不同范围的数值列进行归一化，映射到[0,1]区间：\n",
    "* 更容易做数据横向对比，比如价格字段是几百到几千，增幅字段是0到100\n",
    "* 机器学习模型学的更快性能更好\n",
    "\n",
    "归一化的公式：\n",
    "<div style=\"text-align:left; width:500px;\"><img src=\"./other_files/Normalization-Formula.jpg\" style=\"\"/></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 演示：用户对电影评分的归一化\n",
    "\n",
    "每个用户的评分不同，有的乐观派评分高，有的悲观派评分低，按用户做归一化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>UserID</th>\n",
       "      <th>MovieID</th>\n",
       "      <th>Rating</th>\n",
       "      <th>Timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1193</td>\n",
       "      <td>5</td>\n",
       "      <td>978300760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>661</td>\n",
       "      <td>3</td>\n",
       "      <td>978302109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>914</td>\n",
       "      <td>3</td>\n",
       "      <td>978301968</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3408</td>\n",
       "      <td>4</td>\n",
       "      <td>978300275</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2355</td>\n",
       "      <td>5</td>\n",
       "      <td>978824291</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   UserID  MovieID  Rating  Timestamp\n",
       "0       1     1193       5  978300760\n",
       "1       1      661       3  978302109\n",
       "2       1      914       3  978301968\n",
       "3       1     3408       4  978300275\n",
       "4       1     2355       5  978824291"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings = pd.read_csv(\n",
    "    \"./datas/movielens-1m/ratings.dat\", \n",
    "    sep=\"::\",\n",
    "    engine='python', \n",
    "    names=\"UserID::MovieID::Rating::Timestamp\".split(\"::\")\n",
    ")\n",
    "ratings.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 实现按照用户ID分组，然后对其中一列归一化\n",
    "def ratings_norm(df):\n",
    "    \"\"\"\n",
    "    @param df：每个用户分组的dataframe\n",
    "    \"\"\"\n",
    "    min_value = df[\"Rating\"].min()\n",
    "    max_value = df[\"Rating\"].max()\n",
    "    df[\"Rating_norm\"] = df[\"Rating\"].apply(\n",
    "        lambda x: (x-min_value)/(max_value-min_value))\n",
    "    return df\n",
    "\n",
    "ratings = ratings.groupby(\"UserID\").apply(ratings_norm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>UserID</th>\n",
       "      <th>MovieID</th>\n",
       "      <th>Rating</th>\n",
       "      <th>Timestamp</th>\n",
       "      <th>Rating_norm</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1193</td>\n",
       "      <td>5</td>\n",
       "      <td>978300760</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>661</td>\n",
       "      <td>3</td>\n",
       "      <td>978302109</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>914</td>\n",
       "      <td>3</td>\n",
       "      <td>978301968</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3408</td>\n",
       "      <td>4</td>\n",
       "      <td>978300275</td>\n",
       "      <td>0.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2355</td>\n",
       "      <td>5</td>\n",
       "      <td>978824291</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   UserID  MovieID  Rating  Timestamp  Rating_norm\n",
       "0       1     1193       5  978300760          1.0\n",
       "1       1      661       3  978302109          0.0\n",
       "2       1      914       3  978301968          0.0\n",
       "3       1     3408       4  978300275          0.5\n",
       "4       1     2355       5  978824291          1.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings[ratings[\"UserID\"]==1].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到UserID==1这个用户，Rating==3是他的最低分，是个乐观派，我们归一化到0分；"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 实例2：怎样取每个分组的TOPN数据？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取2018年每个月温度最高的2天数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ymd</th>\n",
       "      <th>bWendu</th>\n",
       "      <th>yWendu</th>\n",
       "      <th>tianqi</th>\n",
       "      <th>fengxiang</th>\n",
       "      <th>fengli</th>\n",
       "      <th>aqi</th>\n",
       "      <th>aqiInfo</th>\n",
       "      <th>aqiLevel</th>\n",
       "      <th>month</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>2018-01-01</td>\n",
       "      <td>3</td>\n",
       "      <td>-6</td>\n",
       "      <td>晴~多云</td>\n",
       "      <td>东北风</td>\n",
       "      <td>1-2级</td>\n",
       "      <td>59</td>\n",
       "      <td>良</td>\n",
       "      <td>2</td>\n",
       "      <td>2018-01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2018-01-02</td>\n",
       "      <td>2</td>\n",
       "      <td>-5</td>\n",
       "      <td>阴~多云</td>\n",
       "      <td>东北风</td>\n",
       "      <td>1-2级</td>\n",
       "      <td>49</td>\n",
       "      <td>优</td>\n",
       "      <td>1</td>\n",
       "      <td>2018-01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>2018-01-03</td>\n",
       "      <td>2</td>\n",
       "      <td>-5</td>\n",
       "      <td>多云</td>\n",
       "      <td>北风</td>\n",
       "      <td>1-2级</td>\n",
       "      <td>28</td>\n",
       "      <td>优</td>\n",
       "      <td>1</td>\n",
       "      <td>2018-01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>2018-01-04</td>\n",
       "      <td>0</td>\n",
       "      <td>-8</td>\n",
       "      <td>阴</td>\n",
       "      <td>东北风</td>\n",
       "      <td>1-2级</td>\n",
       "      <td>28</td>\n",
       "      <td>优</td>\n",
       "      <td>1</td>\n",
       "      <td>2018-01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>2018-01-05</td>\n",
       "      <td>3</td>\n",
       "      <td>-6</td>\n",
       "      <td>多云~晴</td>\n",
       "      <td>西北风</td>\n",
       "      <td>1-2级</td>\n",
       "      <td>50</td>\n",
       "      <td>优</td>\n",
       "      <td>1</td>\n",
       "      <td>2018-01</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          ymd  bWendu  yWendu tianqi fengxiang fengli  aqi aqiInfo  aqiLevel  \\\n",
       "0  2018-01-01       3      -6   晴~多云       东北风   1-2级   59       良         2   \n",
       "1  2018-01-02       2      -5   阴~多云       东北风   1-2级   49       优         1   \n",
       "2  2018-01-03       2      -5     多云        北风   1-2级   28       优         1   \n",
       "3  2018-01-04       0      -8      阴       东北风   1-2级   28       优         1   \n",
       "4  2018-01-05       3      -6   多云~晴       西北风   1-2级   50       优         1   \n",
       "\n",
       "     month  \n",
       "0  2018-01  \n",
       "1  2018-01  \n",
       "2  2018-01  \n",
       "3  2018-01  \n",
       "4  2018-01  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fpath = \"./datas/beijing_tianqi/beijing_tianqi_2018.csv\"\n",
    "df = pd.read_csv(fpath)\n",
    "# 替换掉温度的后缀℃\n",
    "df.loc[:, \"bWendu\"] = df[\"bWendu\"].str.replace(\"℃\", \"\").astype('int32')\n",
    "df.loc[:, \"yWendu\"] = df[\"yWendu\"].str.replace(\"℃\", \"\").astype('int32')\n",
    "# 新增一列为月份\n",
    "df['month'] = df['ymd'].str[:7]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>ymd</th>\n",
       "      <th>bWendu</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>month</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>2018-01</td>\n",
       "      <td>18</td>\n",
       "      <td>2018-01-19</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2018-02</td>\n",
       "      <td>56</td>\n",
       "      <td>2018-02-26</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2018-03</td>\n",
       "      <td>85</td>\n",
       "      <td>2018-03-27</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2018-04</td>\n",
       "      <td>118</td>\n",
       "      <td>2018-04-29</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2018-05</td>\n",
       "      <td>150</td>\n",
       "      <td>2018-05-31</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    ymd  bWendu\n",
       "month                          \n",
       "2018-01 18   2018-01-19       7\n",
       "2018-02 56   2018-02-26      12\n",
       "2018-03 85   2018-03-27      27\n",
       "2018-04 118  2018-04-29      30\n",
       "2018-05 150  2018-05-31      35"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def getWenduTopN(df, topn):\n",
    "    \"\"\"\n",
    "    这里的df，是每个月份分组group的df\n",
    "    \"\"\"\n",
    "    return df.sort_values(by=\"bWendu\")[[\"ymd\", \"bWendu\"]][-topn:]\n",
    "\n",
    "df.groupby(\"month\").apply(getWenduTopN, topn=1).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们看到，grouby的apply函数返回的dataframe，其实和原来的dataframe其实可以完全不一样"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
